We learned the big 5 data transformation functions in the dplyr package:
| function | purpose |
|---|---|
filter() |
Pick observations by their values |
arrange() |
Reorder the rows |
select() |
Pick variables by their names |
mutate() |
Create new variables |
summarise() |
Create summaries |
And we used these functions to create this plot:
By the end of this session, you should be able to:
tidyr package, a member of the tidyverseCtrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspaceIs your project still open? If not, click on the project icon to load it. (Don’t create a new one.)
Run the following code in your console. Change products to your preferred subfolder.
download.file("https://tinyurl.com/y7tmv7uh",
destfile = "products/lab-w05.Rmd")Tools > Global Options
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table (Wickham 2014)
Wickham also uses tidy as a verb meaning “to structure a dataset to facilitate analysis”.
How many variables are in this dataset? This table and the following from Wickham (2014).
Columns are variables and rows are observations (i.e., combination of religion and income):
How would you link up with population data by country and age group to calculate rates? (answer: after tidying)
The element column is not a variable with values, but rather a vector of variable names.
Every row becomes an observation (i.e., weather station by date) with two measurements:
This style of data entry and storage invites errors and inconsistencies:
| function | purpose |
|---|---|
gather() |
Gather variable values spread across multiple columns |
spread() |
Spread out observation values scattered across rows |
separate() |
Split one column into two or more columns |
unite() |
Collapse multiple columns into one column |
Load the tidyverse package and look at the included dataset called table4a.
library(tidyverse)
table4a## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
What is messy about this tibble?
gather()1999 and 2000 are values, not variables. The variable should be year.
Examples and figures from Wickham and Grolemund (2017)
gather() table4a %>% # remember pipes?
gather(`1999`, `2000`, key = "year", value = "cases")## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
We gather values by a set of column names that define the key, which we name year. We then store the values in a new variable called cases.
gather()We get the same result excluding country with a - sign. Note that we’ve also removed the parameter names key and value. This is possible because the default for gather() expects you to specify key before value. See ?gather.
table4a %>% # remember pipes?
gather("year", "cases", -country)## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
gather()Try gathering table4b (also included with tidyverse) to create variables for year and population values (call it “population”).
table4b## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
spread()Now look at table2. type is not a variable!
table2## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
spread()Let’s use spread() to make two proper variables: cases and population.
spread()The key in this example is the type column, and the values we want to spread are stored in count.
table2## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
spread()Just like with gather(), we’ll pass two arguments to spread() in addition to an object: key and value.
spread(table2, key = type, value = count)## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
spread out wide, or gather (stack) into a long (tall) pileThe separate() and unite() functions split and combine columns.
separate()Here we want split rate into two columns: cases and population
separate()It’s simple. Just tell R the column to split and the columns to create:
table3 %>%
separate(col = rate, into = c("cases", "population"))## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()separate() is smart enough to know to split on “/”, but you can also specify with sep = "/". The convert==TRUE argument will guess at the data type rather than leave the new columns characters.
table3 %>%
separate(col = rate, into = c("cases", "population"),
sep="/", convert = TRUE)## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()Can also separate on a specific character position with the sep argument. For example, we separate year into century and year by specifying sep==2. Count from left (1, 2, …) or from right (-1, -2, …).
table3 %>%
separate(col = year, into = c("century", "year"), sep=2)## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite()unite() does the opposite: it combines two or more columns into one.
unite()In this example, new is the name of the new column we want to create by combining columns century and year.
table5 %>%
unite(col = new, century, year)## # A tibble: 6 x 3
## country new rate
## * <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
unite()By default unite() will add an underscore, but adding sep="" (no space) will combine without the _.
table5 %>%
unite(col = new, century, year, sep = "")## # A tibble: 6 x 3
## country new rate
## * <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
One last point to consider is how to handle missing values when transforming data. Let’s make a tibble called stocks with 2 years of quarterly data on returns.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)There are two types of missing data:
complete() returns all combinations of a set of columns and will fill in NA when combinations are missing.
stocks %>%
complete(year, qtr)## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
Now we want to use the disclosure data from the first wave to create a participant characteristics table:
| variable | disclosed | non-disclosed |
|---|---|---|
| c.age | 12.3 | 11.7 |
| c.female | 50.8 | 51.6 |
| p.age | 47.1 | 49.0 |
| p.female | 86.8 | 89.3 |
If you completed the Week 4 tutorial, you should already have the data file you need for today. If not, run the download chunk in your template.
Once you have the data (a csv file), import it into R and assign to an object called dat.
dat <- read.csv("input/r2d2-w04.csv", stringsAsFactors = FALSE)Let’s start by remembering how to summarize our data. Use the dplyr package to get the means of c.age.r1 and p.age.r1. What functions can you use?
You can use summarise() and mean() to get the means of c.age.r1 and p.age.r1.
summarise(dat, c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE))## c.age p.age
## 1 12.12366 47.75269
dat %>%
# notice we omit the object name dat since piped
summarise(c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE))## c.age p.age
## 1 12.12366 47.75269
We want to compare participant characteristics by the disclosure status of the child at baseline. The relevant variable is doesKnow16.r1. What can you add before summarise() to create this indicator variable? Use table() to remind yourself how this variable was coded.
table(dat$doesKnow16.r1)##
## child might know or suspect no, child does not know
## 19 92
## yes, child knows
## 250
dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ 1,
TRUE ~ 0
)) %>%
# only including here to show the new variable
select(p.disclosed) %>%
glimpse() ## Observations: 372
## Variables: 1
## $ p.disclosed <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,...
Add something after mutate() to prepare to summarize the data.
The group_by() function can be used with summarise() to tell R to perform the summary by disclosure status.
dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ "disclosed",
TRUE ~ "non-disclosed"
)) %>%
group_by(p.disclosed) %>%
summarise(c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE))## # A tibble: 2 x 3
## p.disclosed c.age p.age
## <chr> <dbl> <dbl>
## 1 disclosed 12.33200 47.144
## 2 non-disclosed 11.69672 49.000
Use names() to remind yourself of the variables in dat. You can skip the “doesKnow” variables.
names(dat)## [1] "c.age.r1" "c.female.r1" "p.age.r1" "p.female.r1"
## [5] "doesKnow12.r1" "doesKnow15.r1" "doesKnow16.r1" "doesKnow19.r1"
## [9] "doesKnow20.r1" "doesKnow23.r1" "doesKnow24.r1" "doesKnow27.r1"
dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ "disclosed",
TRUE ~ "non-disclosed"
)) %>%
group_by(p.disclosed) %>%
summarise(c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE),
c.female=mean(c.female.r1, na.rm=TRUE),
p.female=mean(p.female.r1, na.rm=TRUE))## # A tibble: 2 x 5
## p.disclosed c.age p.age c.female p.female
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 disclosed 12.33200 47.144 0.5080000 0.8680000
## 2 non-disclosed 11.69672 49.000 0.5163934 0.8934426
Formatting aside, does this tibble resemble a table of participant characteristics you might see in an article?
## # A tibble: 2 x 5
## p.disclosed c.age p.age c.female p.female
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 disclosed 12.33200 47.144 0.5080000 0.8680000
## 2 non-disclosed 11.69672 49.000 0.5163934 0.8934426
What functions from tidyr can help us get there?
gather() dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ "disclosed",
TRUE ~ "non-disclosed"
)) %>%
group_by(p.disclosed) %>%
summarise(c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE),
c.female=mean(c.female.r1, na.rm=TRUE),
p.female=mean(p.female.r1, na.rm=TRUE)) %>%
gather(., key="variable", value="value", -p.disclosed)## # A tibble: 8 x 3
## p.disclosed variable value
## <chr> <chr> <dbl>
## 1 disclosed c.age 12.3320000
## 2 non-disclosed c.age 11.6967213
## 3 disclosed p.age 47.1440000
## 4 non-disclosed p.age 49.0000000
## 5 disclosed c.female 0.5080000
## 6 non-disclosed c.female 0.5163934
## 7 disclosed p.female 0.8680000
## 8 non-disclosed p.female 0.8934426
## # A tibble: 8 x 3
## p.disclosed variable value
## <chr> <chr> <dbl>
## 1 disclosed c.age 12.3320000
## 2 non-disclosed c.age 11.6967213
## 3 disclosed p.age 47.1440000
## 4 non-disclosed p.age 49.0000000
## 5 disclosed c.female 0.5080000
## 6 non-disclosed c.female 0.5163934
## 7 disclosed p.female 0.8680000
## 8 non-disclosed p.female 0.8934426
What do we need to do?
spread() dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ "disclosed",
TRUE ~ "non-disclosed"
)) %>%
group_by(p.disclosed) %>%
summarise(c.age=mean(c.age.r1, na.rm=TRUE),
p.age=mean(p.age.r1, na.rm=TRUE),
c.female=mean(c.female.r1, na.rm=TRUE),
p.female=mean(p.female.r1, na.rm=TRUE)) %>%
gather(., key="variable", value="value", -p.disclosed) %>%
spread(., key=p.disclosed, value=value)## # A tibble: 4 x 3
## variable disclosed `non-disclosed`
## * <chr> <dbl> <dbl>
## 1 c.age 12.332 11.6967213
## 2 c.female 0.508 0.5163934
## 3 p.age 47.144 49.0000000
## 4 p.female 0.868 0.8934426
dat %>%
mutate(p.disclosed = case_when(
doesKnow16.r1=="yes, child knows" ~ "disclosed",
TRUE ~ "non-disclosed"
)) %>%
group_by(p.disclosed) %>%
summarise(c.age=round(mean(c.age.r1, na.rm=TRUE), 1),
p.age=round(mean(p.age.r1, na.rm=TRUE), 1),
c.female=round(mean(c.female.r1, na.rm=TRUE)*100, 1),
p.female=round(mean(p.female.r1, na.rm=TRUE)*100, 1)) %>%
gather(., key="variable", value="value", -p.disclosed) %>%
spread(., key=p.disclosed, value=value) %>%
kable()| variable | disclosed | non-disclosed |
|---|---|---|
| c.age | 12.3 | 11.7 |
| c.female | 50.8 | 51.6 |
| p.age | 47.1 | 49.0 |
| p.female | 86.8 | 89.3 |
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. doi:10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.